Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning

نویسندگان

Tetsuro Morimura

Eiji Uchibe

Junichiro Yoshimoto

Jan Peters

Kenji Doya

چکیده

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate gamma for the value functions close to 1, these algorithms do not permit gamma to be set exactly at gamma = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting gamma = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Agent Learning with Policy Prediction

Due to the non-stationary environment, learning in multi-agent systems is a challenging problem. This paper first introduces a new gradient-based learning algorithm, augmenting the basic gradient ascent approach with policy prediction. We prove that this augmentation results in a stronger notion of convergence than the basic gradient ascent, that is, strategies converge to a Nash equilibrium wi...

متن کامل

Scalable Learning in Stochastic Games

Stochastic games are a general model of interaction between multiple agents. They have recently been the focus of a great deal of research in reinforcement learning as they are both descriptive and have a well-defined Nash equilibrium solution. Most of this recent work, although very general, has only been applied to small games with at most hundreds of states. On the other hand, there are land...

متن کامل

Stable Dynamic Programming and Reinforcement Learning with Dual Representations

We investigate novel, dual algorithms for dynamic programming and reinforcement learning, based on maintaining explicit representations of stationary distributions instead of value functions. In particular, we investigate the convergence properties of standard dynamic programming and reinforcement learning algorithms when they are converted to their natural dual form. Here we uncover advantages...

متن کامل

Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms

Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Per...

متن کامل

Reinforcement Learning for Adaptive Theory of Mind in the Sigma Cognitive Architecture

One of the most common applications of human intelligence is social interaction, where people must make effective decisions despite uncertainty about the potential behavior of others around them. Reinforcement learning (RL) provides one method for agents to acquire knowledge about such interactions. We investigate different methods of multiagent reinforcement learning within the Sigma cognitive...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Neural computation

دوره 22 2 شماره

صفحات -

تاریخ انتشار 2010

Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning

نویسندگان

چکیده

منابع مشابه

Multi-Agent Learning with Policy Prediction

Scalable Learning in Stochastic Games

Stable Dynamic Programming and Reinforcement Learning with Dual Representations

Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms

Reinforcement Learning for Adaptive Theory of Mind in the Sigma Cognitive Architecture

عنوان ژورنال:

اشتراک گذاری